🧠 Inference Serving - emschwartz

Discussed on Hacker News

🔓Open Source AI Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

🏗️LLM Infrastructure vettedconsumer.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

🏗️LLM Infrastructure GitHub·

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

Discussed on Hacker News

🏗️LLM Infrastructure arxiv.org·

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

Less-relevant results

🔓Open Source AI mstar.stanford.edu·

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Discussed on Hacker News

🏗️LLM Infrastructure Martin Alderson·

A brief history of KV cache compression developments

Covers TurboQuant: Redefining AI efficiency with extreme compression

🏗️LLM Infrastructure abhishek.it·

Running GLM-5.2 5x faster at 500tps with limitation

Discussed on Hacker News

🆕New AI huggingface.co·

225B-A23B

Covered by news.smol.ai

Discussed on r/LocalLLaMA

🤖AI devashish.me·

Two Qwen3 models on one DGX Spark: the residency math

Discussed on Hacker News

🏗️LLM Infrastructure Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

🧠LLM Inference arxiv.org·

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

🤖AI GitHub·

ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).

Covers uv

Discussed on Hacker News

🤖AI anbeeld.com·

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Discussed on r/LocalLLaMA

🤖AI lmsys.org·

DFlash and Spec V2 Decoding (14 minute read)

Covers 5 stories including Looking for a self-hosted alternative to Modal.com for running ML workloads

Discussed on Hacker News

🏗️LLM Infrastructure arxiv.org·

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

🤖AI mlx-optiq.comVideo·

Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

Covered by GitHub, Nitter

Discussed on Hacker News

🔓Open Source AI GitHub·

yifanfeng97/Hyper-Extract

Covered by 何夕2077的个人站

🏗️LLM Infrastructure Towards AI

Continuous Batching: How to Keep Your GPU Actually Busy

PagedAttention is more than virtual memory

High Performance Distributed Inference with Ray Serve LLM

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

A brief history of KV cache compression developments

Running GLM-5.2 5x faster at 500tps with limitation

225B-A23B

Two Qwen3 models on one DGX Spark: the residency math

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

DFlash and Spec V2 Decoding (14 minute read)

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

yifanfeng97/Hyper-Extract

“Running Local Models Is Good Now” Was Written on a 64GB Mac. Half of You Have 16GB or Less